Glider
"In het verleden behaalde resultaten bieden geen garanties voor de toekomst"
About this blog

These are the ramblings of Matthijs Kooijman, concerning the software he hacks on, hobbies he has and occasionally his personal life.

Most content on this site is licensed under the WTFPL, version 2 (details).

Questions? Praise? Blame? Feel free to contact me.

My old blog (pre-2006) is also still available.

See also my Mastodon page.

Sun Mon Tue Wed Thu Fri Sat
     
18
       
Powered by Blosxom &Perl onion
(With plugins: config, extensionless, hide, tagging, Markdown, macros, breadcrumbs, calendar, directorybrowse, feedback, flavourdir, include, interpolate_fancy, listplugins, menu, pagetype, preview, seemore, storynum, storytitle, writeback_recent, moreentries)
Valid XHTML 1.0 Strict & CSS
Disabling (broken) sensors in Supermicro IPMI to prevent alarm beeps

A few months ago, I put up an old Atom-powered Supermicro server (SYS-5015A-PHF) again, to serve at De War to collect and display various sensor and energy data about our building.

The server turned out to have an annoying habit: every now and then it would start beeping (one continuous annoying beep), that would continue until the machine was rebooted. It happened sporadically, but kept coming back. When I used this machine before, it was located in a datacenter where nobody would care about a beep more or less (so maybe it has been beeping for years on end before I replaced the server), but now it was in a server cabinet inside our local Fablab, where there are plenty of people to become annoyed by a beeping server...

I eventually traced this back to faulty sensor readings and fixed this by disabling the faulty sensors completely in the server's IPMI unit, which will hopefully prevent the annoying beep. In this post, I'll share my steps, in case anyone needs to do the same.


Shorting fan pins

At first, I noticed that there was an alarm displayed in the IPMI webinterface for one of the fans. Of course it makes sense to be notified of a faulty fan, except that the system did not have any fans connected... It did show the fan speed as 0RPM (or -2560RPM depending on where you looked) as expected, so I suspected it would start up realizing there was no fan but then sporadically seeing a bit of electrical noise on the fan speed pin, causing it to mark the fan as present and immediately as not running, triggering the alarm. I tried to fix this by shorting the fan speed detection pins to the GND pins to make it more noise-resilient.

Temperatures also wonky

However, a couple of weeks later, the server started beeping again. This time I looked a bit more closely, and found that the problem was caused by too high temperature this time. The IPMI system event log (queried using ipmi-sel) showed:

43   | Feb-17-2023 | 09:18:58 | CPU Temp         | Temperature       | Upper Non-critical - going high ; Sensor Reading = 125.00 C ; Threshold = 85.00 C
44   | Feb-17-2023 | 09:18:58 | CPU Temp         | Temperature       | Upper Critical - going high ; Sensor Reading = 125.00 C ; Threshold = 90.00 C
45   | Feb-17-2023 | 09:18:58 | CPU Temp         | Temperature       | Upper Non-recoverable - going high ; Sensor Reading = 125.00 C ; Threshold = 95.00 C
46   | Feb-17-2023 | 16:26:16 | CPU Temp         | Temperature       | Upper Non-recoverable - going high ; Sensor Reading = 41.00 C ; Threshold = 95.00 C
47   | Feb-17-2023 | 16:26:16 | CPU Temp         | Temperature       | Upper Critical - going high ; Sensor Reading = 41.00 C ; Threshold = 90.00 C
48   | Feb-17-2023 | 16:26:16 | CPU Temp         | Temperature       | Upper Non-critical - going high ; Sensor Reading = 41.00 C ; Threshold = 85.00 C

This is abit opaque, but the events at 9:18 show the temperature was read as 125°C - clearly indicating a faulty sensor. These are (I presume) the "asserted" events for each of the thresholds that this sensor has. Then at 16:26, the server was rebooted and the sensor read 41°C again (which I believe is still higher than realistic) and each of the thresholds emits a "deasserted" event.

Looking back, I noticed that the log showed events for both fans and both temperature sensors, so it seemed all of these sensors were really wonky. I could also see the incorrect temperatures clearly in the sensor data I had been collecting from the server (using telegraf, collected using lm-sensors from within the linux system itself, but clearly reading from the same sensor as IPMI):

Graph of erratic temperatures

Note that the graph above shows two sensors, while IPMI only reads two, so I am not sure what the third one is. The alarm from the IPMI log is shown clearly as a sudden jump of the temp2 purple line (jumping back down when the server was rebooted). But also note an unexplained second jump down a few hours later, and note that the next day temp1 dives down to -53°C for some reason, which also matches what IPMI reads:

$ sudo ipmitool sensor                                                
System Temp      | -53.000    | degrees C  | nr    | -9.000    | -7.000    | -5.000    | 75.000    | 77.000    | 79.000    
CPU Temp         | 27.000     | degrees C  | ok    | -11.000   | -8.000    | -5.000    | 85.000    | 90.000    | 95.000    
CPU FAN          | -2560.000  | RPM        | nr    | 400.000   | 585.000   | 770.000   | 29260.000 | 29815.000 | 30370.000 
SYS FAN          | -2560.000  | RPM        | nr    | 400.000   | 585.000   | 770.000   | 29260.000 | 29815.000 | 30370.000 
CPU Vcore        | 1.160      | Volts      | ok    | 0.640     | 0.664     | 0.688     | 1.344     | 1.408     | 1.472     
Vichcore         | 1.040      | Volts      | ok    | 0.808     | 0.824     | 0.840     | 1.160     | 1.176     | 1.192     
+3.3VCC          | 3.280      | Volts      | ok    | 2.816     | 2.880     | 2.944     | 3.584     | 3.648     | 3.712     
VDIMM            | 1.824      | Volts      | ok    | 1.448     | 1.480     | 1.512     | 1.960     | 1.992     | 2.024     
+5 V             | 5.056      | Volts      | ok    | 4.096     | 4.320     | 4.576     | 5.344     | 5.600     | 5.632     
+12 V            | 11.904     | Volts      | ok    | 10.368    | 10.496    | 10.752    | 12.928    | 13.056    | 13.312    
+3.3VSB          | 3.296      | Volts      | ok    | 2.816     | 2.880     | 2.944     | 3.584     | 3.648     | 3.712     
VBAT             | 2.912      | Volts      | ok    | 2.560     | 2.624     | 2.688     | 3.328     | 3.392     | 3.456     
Chassis Intru    | 0x0        | discrete   | 0x0000| na        | na        | na        | na        | na        | na        
PS Status        | 0x1        | discrete   | 0x01ff| na        | na        | na        | na        | na        | na

Note that the voltage sensors show readings that do make sense, and looking at the history, they show no sudden jumps, so those are probably still reliably (even though they are read from the same sensor chip according to lm-sensors).

Disabling the sensors

It seems you can disable generation of events when a threshold is crossed, can even disable reading the sensor entirely. Hopefully this will also prevent the BMC from beeping on weird sensor values.

To disable things, I used ipmi-sensor-config (from the freeipmi-tools Debian package):

  1. First I queried the current sensor configuration:

     sudo ipmi-sensors-config --checkout > ipmi-sensors-config.txt
    
  2. Then I edited the generated file, setting Enable_All_Event_Messages and Enable_Scanning_On_This_Sensor to No. I also had to set the hysteresis values for the fans to None, since the -2375 value generated by --checkout was refused when writing back the values in the next step.

  3. Commited the changes with:

    sudo ipmi-sensors-config --commit --filename ipmi-sensors-config.txt
    

I suspect that modifying Enable_All_Event_Messages allows the sensor to be read, but prevents the threshold from being checked and generating events (especially since this setting seems to just clear the corresponding setting for each available threshold, so it seems you can also use this to disable some of the thresholds and keep some others). However, it is not entirely clear to me if this would just prevent these events from showing up in the event log, or if it would actually prevent the system from beeping (when does the system beep? On any event? Specific events? This is not clear to me).

For good measure, I decided to also modify Enable_Scanning_On_This_Sensor, which I believe prevents the sensor from being read at all by the BMC, so that should really prevent alarms. This also causes ipmitool sensor to display value and status as na for these sensors. The sensors command (from the lm-sensors package) can still read the sensor without issues, though the values are not very useful anyway...).

Note that apparently these settings are not always persistent across reboots and powercycles, so make sure you test that. For this particular server, the settings survive across a reboot, I have not tested a hard power cycle yet.

I cannot yet tell for sure if this has fixed the problem (only applied the changes today), but I'm pretty confident that this will indeed keep the people in our Fablab happy (and if not - I'll just solder off the beeper from the motherboard, but let's hope I will not have to resort to such extreme measures...).

 
0 comments -:- permalink -:- 17:06
Copyright by Matthijs Kooijman - most content WTFPL